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The ISIS Project 

This semi-annual status report covers activities of the ISIS project during the 
second half of 1989* Because this is our second progress report under NASA 
funding, we assume that the reader has some background regarding the goals 
and status of our effort, and focus instead on technical accomplishments during 
the report period and goals for the next six months. In addition to listing our 
most recent publications, we also cite several more general publications at the 
end of this report. Readers unfamiliar with our work should probably start by 
reading our previous semi-annual report (January 1990) and perhaps some of 
these cited papers. 


Goals of the ISIS Effort During the Report Period 

During the first six months of 1990, our project had several independent objec- 
tives: 

1. At the level of the ISIS Toolkit, we undertook to complete ISIS release 
V2.0, containing our “bypass” communication protocols [1]. This effort 
was successful, and V2.0 has been released with a preliminary implemen- 
tation of the bypass protocols. Performance of the system is greatly en- 
hanced by this change, but the initial software release is limited in some 
respects. We are excited by this progress, and have now found several 
additional ways to extend the protocol suite in the most common client- 
server settings that arise under ISIS. We plan to make these changes to 
our system during the coming six months. With these changes, the by- 
pass mechanisms will have largely replaced all other protocol implemen- 
tations in ISIS, and the system should be both more scalable and capable 
of accommodating communication transport protocols based on hardware 
(ethernet and FDDI) multicast. With respect to system releases, we are 
close to a V2.1 release consisting of V2.0 with bug fixes, but are undecided 
as to when these other extensions would be released to the public. 

2. The Meta project focused on the definition of the Lomita programming 
language during the report period. Lomita is a high level language for 
specifying rules that monitor sensors for conditions of interest and trig- 
gering appropriate reactions. This design has now been completed, and 
implementation of Lomita is underway on the Meta 2.0 platform (this 
defines a database interface to system instrumentation and is already op- 
erational). 

3. The Deceit file system effort completed a prototype, which is now oper- 
ational at Cornell. Our current plans are to make Deceit available on a 
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limited basis, primarily for use in two hospital information systems with 
which we are collaborating. 

4. A long-haul communication subsystem project was completed and can 
now be used as part of ISIS. This effort resulted in tools for linking ISIS 
systems on different LANs together over long-haul communications lines. 

5. Several papers were completed during the report period and are described 
below. 

6. Visitor Robbert Van Renesse (a recent graduate of the Ameoba project 
of Vrije University, Amsterdam) developed Magic Lantern, a graphical 
tool for building application monitoring and control interfaces. Magic is 
being included as part of our general ISIS releases. Van Renesse is now 
spending six months at Bell Laboratories, but we are hoping that he will 
subsequently return to our group on a permanent basis. 

Goals of the ISIS Effort During the Remainder of 1990 

Looking to the remaining six months of 1990, our project has identified the 
following objectives: 


1. ISIS Toolkit: This group sees a near-term need to implement the so-called 
w pg_client” extensions to the bypass mode protocols and to complete the 
implementation of our hierarchical process-group tools. 

2. We have become interesting in developing a general purpose ISIS-based 
system resource manager. Such a tool would schedule tasks onto machines 
in a network setting, providing a low-level programmable interface that 
could be specialized for a wide variety of uses. Tools of this sort are easily 
built under ISIS and we see little difficulty in this undertaking, which 
should be of substantial value within the ISIS user community. 

3. We are starting work on the design of a new version of the ISIS sys- 
tem, ISIS++, which would run under the Mach and Chorus kernels. 
Goals would be increased transparency, the ability to support a realtime 
toolkit, improved object-oriented interfaces, and the possibility of exploit- 
ing shared memory and higher performance multicast facilities. The cur- 
rent UNIX-based toolkit would still be supported on top of ISIS+-K 

4. We hope to achieve a working version of Lomita during this period. 

5. We expect to see a working “production” version of Deceit completed 
during this period and in use at some number of Beta-test sites. 
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6. We hoping to complete papers on several topics: Experience with the ISIS 
Toolkit, Part I: Issues of scope and scale (Birman and Cooper), Expe- 
rience with the ISIS Toolkit, Part II: Programming with process groups 
(Birman and Cooper), Theory of failure detection (Birman and Ricciardi), 
Long-haul communication programming methodology (Birman, Makpan- 
gou, Stephenson). 

7. An effort to develop a distributed version of ML over ISIS has attracted 
the interest of Robert Cooper, and we are actively involved in this effort. 
Our hope is that it may yield insight into better programming tools for 
use within the system. 

Publications 

Appendix A contains a copy of our papers on Distributed Application Man- 
agement and on the BYPASS communications protocols. As noted above, a 
substantial number of papers are now in the pipeline, including one on scaling 
through the use of hierarchical mechanisms throughout ISIS, and another on 
our experience programming with process groups. New topics arise constantly. 
We expect to release a number of these in technical report form during the fall 
of 1990. 

The following papers give information on ISIS and META for readers unfamiliar 
with our work. Reprints are available from Cornell. Our project has released a 
much larger number of papers. E-mail reprint requests to ’ , isis@cs.comell.edu” 
or contact the project administrative aide at 607-255-9198. 

1. ISIS USERS MANUAL. Kenneth P. Birman, ed Cornell University De- 
partment of Computer Science. 

This programmers manual discusses the interface presented to ISIS users 
who program in C, Lisp or Fortran . The current version of the manual 
covers ISIS VI. 3; an extensive revision version is planned for March 1990, 
and will include discussion of architectural issues that arise in mapping 
large applications to the ISIS system . 

2. Kenneth P. Birman and Tommy Joseph. Chapters 13, 14 in: Distributed 
Systems , Sape. J. Mullender, ed., Adison Wesley ACM Press Series, ISBN 
41660, 1989. 

This textbook was compiled from the lecture notes used in Arctic 88 and 
Fingerlakes 89, advanced courses in distributed computing. Two chapters 
cover the ISIS approach in considerable detail and represent a good tech- 
nical introduction to our work. The conclusions chapter may also be of 
interest to readers; it explores the general question of robustness in dis- 
tributed systems . 
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The following list identifies publications and papers that will be released in the 
near future. All of these papers are intended for eventual publication in journals 
or conferences. 

1. Kenneth P. Birman, Andre Schiper, Pat Stephenson. “Fast Causal Mul- 
ticast”. April 1990. Available as Cornell University Department of Com- 
puter Science Technical Report 90-1105. 

A new scheme is presented for efficiently implementing a reliable, causally 
ordered multicast primitive . Intended for use in the ISIS toolkit, it offers 
a way to bypass the most costly aspects of ISIS u;Ai7e benefiting from vir- 
tual synchrony . The facility scales with bounded overhead . Performance 
is extremely good over a range of reliability and delivery ordering prop- 
erties; with these new protocols , an application pays for the properties it 
needs . Moreover , users can plug in new protocols and benefit from them in 
the context of the remainder of the ISIS runtime environment . Speedups 
of more than an order of magnitude were obtained when the scheme was 
implemented within Isis. One conclusion is that systems built using ISIS 
can achieve performance competitive with the best existing multicast facil- 
ities - a finding contradicting the widespread concern that fault-tolerance 
may be unacceptably costly. All protocols described in the paper have been 
implemented and instrumented, and the code is available in ISIS V2.0. 

2. Ken Birman and Robert Cooper. “The ISIS Project: Real Experience a 
Fault Tolerant Programming System”. July 1990. Available as Cornell 
University Department of Computer Science Technical Report 90-1138. 

The ISIS Project has developed a distributed programming toolkit and a 
collection of higher level applications based on these tools. ISIS is now in 
use at more than 300 locations world-wide. Here, we discuss the lessons 
(and surprises) gained from this experience with the real world. 

3. Keith Marzullo. “MTP: An Atomic Multicast Transport Protocol” . July 
1990. Available as Cornell University Department of Computer Science 
Technical Report 90-1141. 

This paper describes MTP: a reliable transport protocol that utilizes the 
multicast strategy of applicable lower layer network architectures. In addi- 
tion to transporting data reliably and efficiently, MTP provides the client 
synchronization necessary for agreement on the receipt of data and the 
joining of the group of communicants . 

4. Ken Birman and Aleta Ricciardi. “A Formalism for Fault-Tolerant Ap- 
plications”. July 1990. Available from Cornell University Department of 
Computer Science. 

Formal methods for specifying fault-tolerant requirements are extremely 
important for proving a given application correct and robust . Formal 
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methods provide a clear and specific description and can lead to a bet- 
ter understanding of the problem. Moreover , a formalism enables one to 
quantify problems and solutions and compart their strengths. By quanti- 
fying both , minima/ solutions can be fitted to problems. We believe modal 
logics , derived from the model of computation , provide an accessible and 
useful means of specifying and reasoning about fault-tolerant requirements 
for asynchronous systems. We developed a tense- epistemic logic based on a 
model of computation whose basic semantic entities are the consistent cuts 
of an asynchronous run. We use it to specify the safety and liveness prop- 
erties of the Group Membership Problem in asynchronous systems that can 
experience crash failures. Phrased in terms of knowledge, these conditions 
quantify solutions in that there is a correspondence between message com- 
plexity and the level of knowledge attained. We use this correspondence to 
derive two solutions to the Group Membership Problem. 

5. Keith Marzullo, Robert Cooper, Mark Wood, Ken Birman. “Tools for 
Distributed Application Management”. June 1990. Available as Cornell 
University Department of Computer Science Technical Report 90-1136. 

It is common to construct software systems by in/erconnec/in^ non- distributed 
components , using remote procedure calls and stream communication chan- 
nels. This paper examines the problem of distributed application manage- 
ment as it arises in systems having this structure. Our discussion is based 
on the META system: a collection of utilities and tools for constructing 
distributed application management software. Built using the ISIS toolkit , 
these include facilities for monitoring and scheduling activity in an un- 
derlying system , for dynamically reconfiguring in response to failures or 
load changes and for automatically restarting failed system components. 

A key facility is a sensor/actuator interface supporting a fault-tolerant 
da/a6ase of realtime sensor values that can be interogated using a realtime 
interval-logic query language. The set of sensors is completely extensi- 
ble and may include such values as machine load , temperature readings 
from a thermometer , and even values of dynamically updated variables in 
the memory space of a user-process. The META system is available from 
Cornell University as part of its ISIS system distribution. 

6. Messac Makpangou, Kenneth P. Birman, Pat Stephenson. “Designing 
Partitioned Wide-Area Applications”. Cornell University Department of 
Computer Science Technical Report, delayed (Sept. 1990). 

This technical report describes the new ISIS facilities for in/erconnec/in <7 
systems running on physically separated local area networks. The facility 
assumes that links will normally be down and are open only periodically; it 
spools communication automatically and transmits in bursts when the op- 
portunity arises. By examining the needs of typical wide area applications 
(scheduling and a replicated directory) and argument is made that these 
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facilities will be adequate for solving a wide variety of wide-area problems. 
The facility has been fully implemented and instrumented; it is included 
as part of ISIS V2.0. 

7. Kenneth P. Birman, Robert Cooper, Keith Marzullo. “ISIS and META 
Projects: Progress Report”. March 1990. Available as Cornell University 
Department of Computer Science Technical Report 90-1103. 

ISIS and META are two distributed systems projects at Cornell University. 
The ISIS project , led by Ken Birman, has developed a new methodology, 
virtual synchony, for writing robust distributed software. 

8. Keith Marzullo and Mark Wood. “Making Real-Time Reactive Systems 
Reliable”. March 1990. Available from Cornell University Department of 
Computer Science. 

A reactive system is characterized by a control program that interacts with 
an environment (or controlled program). The control program monitors 
the environment and reacts to significant events by sending commands 
to the environment. This structure is quite genera/. Not only are most 
embedded real-time systems reactive systems, but so are monitoring and 
debugging systems and distributed application management systems. Since 
reactive systems are usually long-running and may control physical equip- 
ment, fault-tolerance is vital. Our research tries to understand the princi- 
pal issues of fault-tolerance in real-time reactive systems and to build tools 
that allow a programmer to design reliable, real-time reactive systems. 

9. Keith Marzullo, O. Babaoglu and Fred Schneider. “Priority Inversion and 
Its Prevention”. February 1990. Available as Cornell University Depart- 
ment of Computer Science Technical Report 90-1088. 

A priority inversion occurs when a low-priority task causes execution of 
a higher-priority task to be delayed. The possibility of priority inversion 
complicates the analysis of systems that use priority-based schedulers be- 
cause priority inversions invalidate the assumption that a teask can be 
delayed only by higher-priority tasks . This paper formalizes priority in- 
version and gives sufficient conditions as well as some new protocols for 
preventing priority inversions . 
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ISIS Publications List 

83- 552 ISIS: An Environment for Constructing Fault-Tolerant Distributed Systems. Bir- 

man, Skeen, El Abbadi, Dietrich and Raeuchle. May 1983. 

84- 594 Implementing Fault-Tolerant Distributed Objects. Birman, Joseph, Raeuchle, and 

El Abbadi. 4 M Symposium on Reliability in Distributed Systems and Database Sys- 
tems , Silver Springs, MD, October 1984. Available as reprint: IEEE Transactions 
on Software Engineering, SE-11, 6, June 1985, Pgs. 502-508. 

84-642 An Overview of the ISIS Project, Birman, El Abbadi, Dietrich, Joseph and 
Raeuchle. October 1984. IEEE Distributed Processing Technical Committee 
Newsletter. January 1985. 

84- 644 Low Cost Management of Replicated Data in Fault-Tolerant Distributed Systems. 

Birman and Joseph. October 1984. Available as reprint: ACM Transactions on 
Computer Systems, 4, 1, February 1986, Pgs. 54-70. 

85- 668 Replication and Fault-Tolerance in the ISIS System, Birman. March 1985 (Revised 

September 1985). 10th ACM Symposium on Operating Systems Principles , Decem- 
ber 1985, 79-86. Also appearing as: Operating Systems Review, 19, 5, December 
1985. 

85-694 Reliable Communication in the Presence of Failures. Birman and Joseph. July 1985. 

(Revised August 1986). Available as reprint: ACM Transactions on Computer 
Systems, 5, 1, February 1987, Pgs. 47-76. 

85- 712 Low Cost Management of Replicated Data. Joseph. (Ph.D. Thesis). November 

1985. 

86- 744 ISIS: A System for Fault-Tolerance in Distributed Systems. Birman. April 1986. 

86-753 Communication Support for Reliable Distributed Computing. Birman and Joseph. 

May 1986. Proc. Asilomar Workshop on Fault Tolerant Distributed Computing , 
March 1986. 

86-772 Programming with Shared Bulletin Boards in Asynchronous Distributed Systems. 
Birman, Joseph and Stephenson. August 1986. (Revised December 1986). 

86- 781 Efficient Concurrency Control for Libraries of Typed Objects. Raeuchle. (Ph.D. 

Thesis). September 1986. 

87- 811 Exploiting Virtual Synchrony in Distributed Systems. Birman and Joseph. Febru- 

ary 1987. 11th ACM Symposium on Operating Systems Principles , December 1987. 
Also appearing as: Operating Systems Review, 22, 1, December 1987, 123-38. 
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ISIS Publications List Continued 

87- 849 ISIS - A Distributed Programming Environment, Version 1.3 - User’s Guide and 

Reference Manual, Birman, Joseph and Schmuck. July 1987. 

88- 917 Exploiting Replication. Birman and Joseph. June 1988. 

88-918 Reliable Broadcast Protocols. Joseph and Birman. June 1988. 

88-928 The Use of Efficient Broadcast Protocols in Asynchronous Distributed Systems. 
Schmuck. (Ph.D. Thesis). August 1988. 

88- 949 Causally Consistent Recovery of Partially Replicated Logs. Birman and Kane. 

November 1988. Submitted for publication. 

89- 996 Concurrency Control for Transactions with Priorities. Marzullo. May 1989. 

89-997 Implementing Fault-Tolerant Sensors. Marzullo. May 1989. Submitted for publi- 
cation. 

The ISIS Distributed Programming Toolkit and The Meta Distributed Operating 
System. Birman and Marzullo. SUN Technology , 2, 1 (Summer 1989). 

89-1001 The Role of Order in Distributed Programs. Birman and Marzullo. March 1989. 
Submitted for publication. 

89-1014 An Advanced Course on Distributed Systems Lecture notes from Artie ’ S3 , an ad- 
vanced course on distributed systems. Addison- Wesley, 1989. To be published. 

Supporting Large Scale Applications on Networks of Workstations, Cooper and 
Birman, April 1989. 

89-1042 Deceit: A Flexible Distributed File System. Siegel, Birman and Marzullo. Novem- 
ber 1989. 

89- 1067 Log-Based Recovery in Asynchronous Distributed Systems. Kenneth Kane. De- 

cember 1989. 

90- 1103 ISIS and Meta Projects: Progress Report. Birman, Cooper and Marzullo. February 

1990. 

90-1105 Fast Causal Multicast. Birman, Schiper and Stephenson. March 1990. 

90-1136 Tools for Distributed Application Management. K. Marzullo, R. Cooper, M. Wood, 
K. Birman. June 1990. 

90-1138 The ISIS Project: Real Experience with a Fault Tolerant Programming System. K. 
Birman and R. Cooper. July 1990. 
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Budget 

A budgetary summary for the report period has been submitted directly to Maj. 
Boesch using the DARPA electronic reporting format, with a copy to Jerry Yan 
at NASA Ames. Expenditures are in line with projections under our current 
operating budget. 
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Kenneth Birman 

Department of Computer Science, Cornell University 
Andre Schiper 

Ecole Polytechnique Federate de Lausanne, Switzerland 
Pat Stephenson 

Department of Computer Science, Cornell University 
April 10, 1990 


Abstract 

A new protocol is presented that efficiently implements a reliable, causally or- 
dered multicast primitive and is easily extended into a totally ordered one. Intended 
for use in the Isis toolkit, it offers a way to bypass the most costly aspects of Isis 
while benefiting from virtual synchrony. The facility scales with bounded overhead. 
Measured speedups of more than an order of magnitude were obtained when the pro- 
tocol was implemented within Isis. One conclusion is that systems such as Isis can 
achieve performance competitive with the best existing multicast facilities - a finding 
contradicting the widespread concern that fault-tolerance may be unacceptably costly. 

Keywords and phrases: Distributed computing, fault- tolerance, process groups, 
reliable multicast, ABCAST, CBCAST, Isis. 
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Introduction 


The Isis Toolkit [BJKS88] provides a variety of tools for building software in loosely 
coupled distributed environments. The system has been successful in addressing problems 
of distributed consistency, cooperative distributed algorithms, and fault-tolerance. At the 
time of this writing, ISIS was in use at more than 250 locations worldwide. 

Two aspects of Isis are key to its overall approach: 

• An implementation of virtually synchronous process groups. 

• A collection of atomic multicast protocols with which processes and group members 
interact with groups. 

Although Isis supports a wide range of multicast protocols, a protocol called CBCAST 
accounts for the majority of communication in the system; in fact, many of the Isis tools 
are little more than invocations of this co mmuni cation primitive. For example, the Isis 
replicated data tool uses a single (asynchronous) CBCAST to perform each update and 
locking operation; reads require no communication at all. A consequence is that the cost 
of CBCAST represents the dominant performance bottleneck in the Isis system. 

The initial Isis CBCAST protocol was costly in part for structural reasons, and in part 
because of the protocol used. The implementation was within a protocol server, hence all 
CBCAST co mmuni cation was via an indirect path. Independent of the cost of the proto- 
col itself, this indirection was tremendously expensive. With respect to the protocol used, 
our initial implementation favored generality over specialization, permitting extremely 
flexible destination addressing, and using a piggybacking mechanism that achieved a de- 
sired ordering property but required a garbage collection mechanism. On the other hand, 
this structure seemed to be the only one capable of supporting a powerful, general set of 
programming tools like the ones in our toolkit: simpler protocols often simply overlook 
critical forms of functionality, which may explain why so few have entered widespread 
use. Particularly valuable to us has been the ability to to support multiple, possibly 
overlapping process groups, and virtual synchrony [BJKS88]. 

The protocol we present here is based on a causal ordering protocol originally developed by 
Schiper [SES89]. Unlike our previous work, it assumes a p reexisting virtually synchronous 
programming environment like the one that Isis provides, although using few of its fea- 
tures. Further, it supports a relatively restricted form of multicast addressing. Were our 
work done outside of the context of Isis, this would seriously limit its generality. In our 
implementation, however, messages that do not conform to these restrictions are simply 
routed via the old, more costly algorithm. A highly optimized multicast protocol results 
that bypasses the old Isis system and imposes very little overhead beyond that of the 
message transport layer. The majority of Isis communication satisfies the requirements 
of the bypass protocols and hence benefits from our work. 

Our protocol uses a timestamping scheme, and in this respect resembles prior work by 
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Ladkin [LL86] and Peterson [PBS89]. However, our results are substantially more general. 
The most important differences are these: 

• Peterson’s Psync-based protocol can be used only in systems composed of a single 
process group, ours supports multiple, possibly overlapping process groups. 

• Both Peterson’s and Ladkin’s protocols have overhead linear in the number of pro- 
cesses that ever participated in the application, which could be large; our overhead 
is bounded and small. 

Like Peterson’s and Ladkin’s protocols, our basic protocol provides for message delivery 
ordering that respects causality in the sender (CBCAST), but is readily extended into a 
more costly protocol that provides a total delivery ordering even for concurrent invocations 
(ABCAST). 

The bypass protocol suite lets users select the multicast properties desired for an appli- 
cation. Choices include a “raw” delivery service achieving extremely high performance 
but with mini mal reliability guarantees, multicast with atomicity and FIFO delivery, and 
causal or total ordering. This approach permits the. user to pay for just those reliability 
and ordering properties needed by the application. 

The paper is structured as follows. Section 2 reviews the multicasting problem and defines 
our terminology. Sections 3 and 4 introduce our new technique. Section 5 discussions 
extensions of the CBCAST protocol, including the bypass ABCAST protocol. The 
costs of our various primitives are measured in Section 6. 


2 Execution model 

2.1 Basic system model 

The system is composed of processes P = {j>i,j>2> ...,p n } with disjoint memory spaces. 
Initially, we assume that this set is static and known in advance; later we relax this 
assumption. Processes fail by crashing detectably (a fail-stop assumption); notification is 
provided by Isis in a manner described below. In many situations, processes will need 
to cooperate. For this purpose, they form process groups. Each such group has a name 
and a set of member processes; members join and leave dynamically; a failure causes a 
departure from all groups to which a process belongs. The members of a process group 
need not be identical, nor is there any limit on the number of groups to which a process 
may belong. The set of groups is denoted by G = {51,52...}. In typical settings, the 
number of groups will be large and processes will belong to several groups. 

Our system model is unusual in assuming an external service that implements the pro- 
cess group abstraction. The interface from a process to this service will not concern us 
here, but the manner in which the service communicates to a process is highly relevant. 
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A view of a process group is a list of its members. A view sequence for g is a list 
wew 0 (j)i viewi(g), — , * >iew n (g), where 

1. viewo(g) = 0. 

2. V* : viewi(g)CP, where P is the set of all processes in the system. 

3. view, (5) and view,- +1(5) differ by the addition or subtraction of exactly one process. 

We assume that some sort of process group service computes new views and communicates 
them to the members of the groups involved. Processes learn of the failure of other group 
members only through this view mechanism, never through any sort of direct observation. 

We assume that direct communication between processes is always possible; the software 
implementing this is called the message transport layer. Within our protocols, processes 
always co mmuni cate using point-to-point and multicast messages; the latter may be trans- 
mitted using multiple point-to-point messages if no more efficient alternative is available. 
The transport co mmuni cation primitives must provide lossless, uncorrupted, sequenced 
message delivery. Our approach permits application builders to define new transport pro- 
tocols, perhaps to take advantage of special hardware. Our initial implementation uses 
tiprpliahlo datagrams, but has an experimental protocol that exploits ethernet hardware 
multicast. 

The execution of a process is a partially ordered sequence of events, each corresponding 
to the execution of an indivisible action. An acyclic event order, denoted reflects 
the dependence of events occurring at process p upon one another. The event send p (m ) 
denotes the transmission of m by process p to a set of 1 or more destinations dests(m); the 
receive event is denoted rcv p (m). We omit the subscript when the context is unambiguous. 
If [dests(m)| > 1 we will assume that send puts messages into all communication channels 
in a single action that might be interrupted by failure, but not by other send or rev actions. 
We denote by rev p ( view i(g)) the event by which a process p belonging to g “learns” of 
viewi(g). 

We distinguish the event of receivings, message from the event of delivery, since this allows 
us to model protocols that delay message delivery until some condition is satisfied. The 
delivery event is denoted deliver(m) where rcv(m)^*deliver(m). 

2.2 Properties required of multicast protocols 

Although Isis makes heavy use of virtual synchrony, it will not be necessary to formalize 
this property for our present discussion. However, the support of virtual synchrony places 
several obligations on the processes in our system. First, when a process multicasts a 
message m to group g, dests(m) must be the current membership of g. Secondly, when 
the group view changes, all messages sent in the prior view must be “flushed” out of the 
system (delivered) before the new view may be used. Finally, messages must satisfy a 
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failure atomicity property: if a message m is delivered to any member of a group, and it 
stay operational, m must be delivered to all members of the group even if the sender fails 
before completing the transmission. 

The multicast protocols that interest us here also provide delivery ordering guarantees. As 
in [Lam78], we define the potential causality relation for the system, — as the transitive 
closure of the relation defined as follows: 

1. If 3p : e-^e', then e— 

2. Vm : send(m)—*rcv(m) 

CBCAST satisfies a causal delivery property: 

If m and m' are CBCAST’s and send(m)->send(m') then 

Vp€des£s(m)ndesfs(m') : deliver {m)^* deliver ( m') . 

If two CBCAST messages are concurrent, the protocol places no constraints on their 
delivery ordering at overlapping destinations. 

ABCAST extends the CBCAST ordering into a total one: 

If m and m' are AB CAST’s then either 

1. Vpedests(m)ndests(m') : deliver (m)^* deliver (m'), or 

2. Vp€des^s(m)n(fes^s(m , ) : deliver( m')^* deliver( m) . 

Because the ABCAST protocol orders concurrent events, it is more costly than CB- 
CAST; requiring synchronous solutions where the CBCAST protocol admits efficient 
asynchronous solutions. Birman and Joseph [BJ89] and Schmuck [Sch88] have exhibited a 
large class of algorithms that can be implemented using asynchronous CBCAST. More- 
over, S chm uck has shown that in many settings algorithms specified in terms of ABCAST 
can be modified to use CBCAST without compromising correctness. 

The protocols presented here all assume that processes only multicast to groups that they 
are members of, and that all multicasts are to the full membership of a single group. 

For demonstrating liveness, we will assume that any message sent by a process is eventually 
received unless the sender or destination fails, and that failures are eventually reported 
by ISIS. 

3 The CBCAST bypass protocol 

This section presents two basic CBCAST protocols for use within a single process group 
with fixed membership. Both use timestamps to delay messages that arrive out of causal 
order. The section that follows extends these schemes and then merges them to obtain a 
single solution for use with multiple, dynamic process groups. 
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3.1 Timestamping protocols 

We begin by describing two protocols for assigning timestamps to messages and for com- 
paring timestamps. The protocols are standard except in one respect: whereas most 
timestamping protocols count arbitrary “events”, the ones defined here count only send 
events. 

3.2 Logical time 

The first timestamping protocol is based on one introduced by [Lam78], called the logical 
clock protocol. Each process p maintains an unbounded local counter, LT{p), which it 
initializes to zero. For each event send(m) at p, p sets LT(p) = LT(p) + 1 . Messages 
are timestamped with the sender’s incremented counter. A process p receiving a message 
with timestamp LT(m ) sets LT{p) — max(LT{p),LT{m)). As in [Lam78], one can show 
that if send(m)-+send(m') then LT(m ) < LT{m!). The converse, however, does not hold: 
the protocol may order messages that were sent concurrently. 

Note that the LT counter for a process is updated at the rev event, as opposed to the 
deliver event, for an incoming message. We make use of this property in the development 
below. 


3.3 Vector time 

A second timestamping protocol is based on the substitution of vector times for the local 
counters in the logical time protocol. Vector times were proposed originally in [Mar84]; 
other researchers have also used them [Fid88,Mat89,LL86,Sch88]; our use of them is moti- 
vated by an protocol presented in [SES89]. In comparison with logical times, this protocol 
has the advantage of representing — » precisely. 

A vector time for a process p,, denoted VT(pi), is a vector of length n (where n = |P|), 
indexed by process-id. 

1. When pi starts execution, VT(pi) is initialized to zeros. 

2. For each event send(m) at p,-, VT(p<)[;] is incremented by 1. 

3. Each message sent by process p,- is timestamped with the incremented value of 
VTfc). 

4. When process pj delivers a message m from pi containing VT(m), pj modifies its 
vector time in the following manner: 

VJfc(=l..n : VT{pj)[k) = max{ VT(p ; p], VT(mp]) 

Rules for comparing vector times are: 
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1. VT X < VT 2 iff Vi : VT x [i\ < VT 2 [i\ 

2. VTi < VT 2 if VT, < VT 2 and 3i : VT x [i ] < VT 2 [i\ 

Notice that in contrast to the rule for LT(p), VT{p ) is updated at the deliver event for 
an incoming message. We will make use of this distinction below. 

It can be shown that given messages m and m', send{m)-*send{m') iff VT(m) < YT(m') 
[Mat89,Fid88]: vector timestamps represent causality precisely. This constitutes the fun- 
damental property of vector times, and the primary reason for our interest in such times 
as opposed to logical ones. 

3.4 Causal message delivery 

Recall that if processes communicate using CBCAST, all messages must be delivered in 
an order consistent with causality. Suppose that a set of processes P communicate using 
only broadcasts to the full set of processes in the system; that is, Vm : dests(m) = P. This 
hypothesis is unrealistic, but Section 4 will adapt the resulting protocol to a settings with 
multiple process groups. 1 We now develop two delivery protocols by which each process p 
receives messages sent to it, delays them if necessary, and then delivers them such that: 
If send(m)-* send(m') then de liver ( m ) — * deliver(m') . 

3.4.1 LT protocol 

Our first solution to the problem is based on logical clocks; and is referred to as the LT 
protocol from here on. It is related to other solutions that have appeared in the literature 
[Lam78,CASD86] and will be used as a building block later on. The basic technique will 
be to delay a message until messages with at least as large a timestamp has been received 
from every other process in the system. However, since this would only work if every 
process sends an infinite stream of multicasts, a channel flushing mechanism is introduced 
to avoid potentially unbounded delays. 

Say that the channel from process pj to p, has been flushed at time LT (m) if pi will 
never receive a message m' from pj with LT(m') < LT(m). Flushing can be achieved 
by pinging. To ping a channel, p x sends pj a timestamped inquiry message inq, but 
without first incrementing LT(pi). On receiving an inquiry pj, as usual, sets LT(pj) = 
max(LT(pj), LT( inq)) and replies with an ack message containing LT(pj), without mod- 
ifying LT(pj). On receiving the ack p,, as usual, sets LT(pi ) = max{LT{pi), LT( ack)). 
If no new messages are being multicast, pinging advances LT(pi) and LT(pj) to the same 
value. 

The protocol is as follows: 

l This hypothesis is actually used only in the VT delivery protocol. 
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1. Before sending message m, process p, increments LT( Pi ) and then timestamps m. 

2. On receiving message m, process Pj sets LT( Pj ) = max(LT(p<) LT(m)\ Th^ n . 
ddaj^ m until for all k # t, the channel between Pj and Pk has b^en flushed for time 
Li (mj. P j does not delay messages received from itself. 

3 ' "* has the ““ainnm timestamp among messages satisfying (2), m may be deliv- 

cTcUi 

if two message, such that sen<f( mO-senifm,), 

and hence LT(m x ) < LT(m 2 ). There are two cases: ’ 

1. The same process sends m x and m 2 . Because communication is FIFO m, will be 

receive! before m 2 and because LT{m x ) < LT(m 2 ), condition 3 guarantees that 
mi will be delivered before m 2 . 

2. Different jtrocesses send m x and m 2 . According to condition 2, m 2 can only be 

FffO M l7/rr wTw "r b ,T flUShed fOT As ^unication is 

’ ( m i) < LT{m 2 ), it follows that m x has been received. Condition 3 

then guarantees that m x will be delivered before m 2 . 

The communication cost, however, is high: 2n - 3 messages may be needed to flush 
channds for every message delivered, hence to multicast one message, 0(n 2 ) messages 
could be transmitted. For infrequent multicasting, this cost may wU be tolerable; the 
overhead would be unacceptable if incurred frequently. However, in place of pinging 
processes can periodically multicast their logical timestamps to all other group members 
Receipt of such a multicast flushes the channels: at worst, a received message will be 
delayed until the recipient has multicast its timestamp and all other processes have done 

a subsequent timestamp multicast. The overhead of the protocol can now be tuned for a 
given environment. 2 


3.4.2 VT protocol 


A “ UC l? eaper , i° n Can be derived usin S vector timestamps; we will refer to this 

KTfmvlT* TT The \ de t “ baSiCaUy the same “ ^ ‘he LT protocol, but because 
T ( m)[Jb] indicates precisely how many multicasts by process Pk precede m, a recipient 

of m will know precisely how long m must be delayed prior to delivery; namely, until 

^Reader. familiar with the A-T real-time protocol, of [CASD86] will note the similarity between that 
protocol and th» version of ours. Clock synchronization (on which the A-T scheme is baald) i. normally 

lori^aTdij^ 1 ^ 10 m “ ltlCMU J ST87 l- Thia modification recalls suggestions made in [Lam78], and mak« 
lopcal clock, behave like weakly synchronized physical clocks. Clock synchronization algorithms with 
good message complenty are known, hence substitution of a A-T based protocol for the logical cioc^ 
bawd protocol in our combined” algonthm, below, is an intriguing direction for future study^ 
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VT(m)[k ] messages have been delivered from Pk- Since — ► is an acyclic order accurately 
represented by the vector time, the resulting delivery order is causal and deadlock free. 
The protocol is as follows: 

1. Before sending m, process p t increments VT(p;)[i] and timestamps m. 

2. On reception of message m sent by p,- and timestamped with VT(m), process p : ^ p,- 
delays m until 

VT{m)[t\ = VT( Pj )[t\ + 1 

VJfc ^ i : 7T(m)[fc] < VT( Pj )[k] 

Process pj need not delay messages received from itself. 

3. When a message m is delivered, VT{pj)[i\ is incremented (this is simply the vector 
time update protocol from Section 3.3). 

Step 2 is the key to the protocol. This guarantees that any message m ! transmitted 
causally before m (and hence with VT(m') < VT(m)) will be delivered at p } before m is 
delivered. An example in which this rule is used to delay delivery of a message appears 
in Figure 1. 



Time 


Figure 1: Using the VT rule to delay message delivery 

The correctness of the protocol will be proved in two stages. We first show that causality is 
never violated (safety) and then we demonstrate that the protocol never delays a message 
indefinitely (liveness). 

Safety. Consider the actions of a process pj that receives two messages m\ and m2 such 
that send(mi)-*send(m 2 ). 
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Case 1. m\ and m 2 are both transmitted by the same process p,. Recall that we 
assumed a lossless, sequenced communication system, hence pj receives mi before 
m 2 . By construction, VT(mi) < VT(m 2 ), hence under step 2, m 2 can only be 
delivered after mi has been delivered. 

Case 2. mi and m 2 are transmitted by two distinct processes pi and pj<. We will show 
by induction on the messages received by process pj that m 2 cannot be delivered 
before mi. Assume that mi has not been delivered and that pj has received k 
messages. 

Observe first that send(m\)— *send(m 2 ), hence VT(m^) < VT(m 2 ) (basic property 
of vector times). In particular, if we consider the field corresponding to process p,-, 
the sender of mi , we have 

VT(m i)[i] < VT(m 2 )[t] (1) 

Base case. The first message delivered by pj cannot be m 2 . Recall that if no 
messages have been delivered to pj, then VT(pj)[t] = 0. However, VT(mi)[t] > 
0 (because mi is sent by pj), hence VT(m 2 )(»] > 0. By application of step 2 of 
the protocol, m 2 cannot be delivered by pj. 

Inductive step. Suppose pj has received k messages, none of which is a message 
m such that send(mi)—*send(m). If mi has not yet been delivered, then 

VT{pj)[i\ < KT(m a )[i] (2) 

This follows because the only way to assign a value to VT(pj)[i] greater than 
VT(mi)[i] is to deliver a message from pj that was sent subsequent to mi, and 
such a message would be causally dependent on mi. From relations 1 and 2 it 
follows that 

VT{pj)[i\ < Kr(m2)(t] 

By application of step 2 of the protocol, the k + l’st message delivered by p ; 
cannot be m 2 . □ 

Liveness. Suppose that there exists a broadcast message m sent by process pi that can 
never be delivered to process pj. Step 2 implies that either: 

VT(m)(.'] ji VT{ tj )\i\ + 1, or 

3k ji i : VT(m)[k] > VT(pj)[k) 

and that m was not transmitted by process pj. We consider these cases in turn. 

1. VT(m)[i] ^ VT{pj)[i] + 1, that is, m is not the next message to be delivered from pj 
from pj. Since all messages are multicast to all processes and channels are lossless 
and sequenced, it follows that there must be some message m‘ sent by p,- that pj 
received previously, has not yet delivered, and with VT{m!)[i\ = VT(pj)[i] + 1. If 
m' is also delayed, it must be under the other case. 
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2 . 3k ^ i : VT(m)[k] > VT(pj)[k]. Let n = VT(m)[k]. The n’th transmission of 
process pt, must be some message m'-+m that has either not been received at pj, 
or was received and is delayed. Under the hypothesis that all messages are sent 
to ail processes, m f was already multicast to pj . Since the communication system 
eventually delivers ail messages, we may assume that m f has been received by pj. 
The same reasoning that was applied to m can now be applied to m\ The number 
of messages that must be delivered before m is finite and > is acyclic, hence this 
leads to a contradiction. □ 

4 Extensions to the basic protocol 

Neither of the protocols in Section 3 is suitable for use in a virtually synchronous setting 
with multiple process groups and dynamically changing group views. This section first 
extends the simple VT CBCAST protocol of Section 3.4.2 into one suitable for use 
with multiple but static process groups, but arrives at a protocol subject to a significant 
constraint on what we call the communication structure of the system. Then, we show 
how to combine the protocol with other mechanisms, notably the LT CBCAST protocol 
of Section 3.4.1, to overcome this limitation. We arrive at a powerful, general solution. 

4.1 Transmission limited to within a single process group 

The first extension to the VT protocol is concerned with processes that multicast only 
within a single process group at a time. This problem is clearly trivial if process groups 
don’t overlap, a property that can be deduced at runtime (see Section 4.4.4). On the other 
hand, we have assumed that overlap will not be uncommon. Such scenarios motivate the 
series of changes to the algorithm presented in this section and the ones that follow. 

The first change is concerned with processes that belong to multiple groups, e.g. a process 
Pi belongs to groups g a and <?<,, and multicasts only within groups. Multicasts sent by pi 
to g a must be distinguished from those to < 7 j>, since a process Pj belonging to gt, and not 
to g a that receives a message with VT(m)\j] = k will otherwise have no way to determine 
how many of these Jfc messages were sent to gt, and hence precede m causally. This leads us 
to extend the single VT clock to multiple VT clocks; VT a is the logical clock associated 
with group g a , and VT 0 [t] thus counts multicasts by process pi to group g a . 3 Processes 
maintain VT clocks for each group in the system, and attach all the VT clocks to every 
message that they multicast. 

The next change is to step 2 of the VT protocol. Suppose that process pj receives a message 
m sent in group g a with sender p,-, and that p } also belongs to groups { 51 , ..., <? n } = Gy 
Step 2 can be replaced by the following rule: 

’Clearly, if pi is not & member of g a , then VT 0 [i] = 0, thus allowing a sparse representation of the 
timestamp. For clarity, we will continue to represent each timestamp VT t as a vector of length n, with a 
special entry * for each process that is not a member of g a ■ 
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2’ On reception of message m from p, ^ pj, sent in g a , process pj delays m until 

2.1’ VT a (m)[i] = VT a (pj)[t\ + 1, and 

2.2’ Vfc : (p k €g a A k i) : VT a (m)[k ]' < VT( Pj )[kJ, and 

2.3’ 'ig : (g€G } ) : VT g (m ) < VT g ( Pj ). 

As above, pj does not delay messages received from itself. 

Figure 2 illustrates the application of this rule in an example with four processes into 
groups identified as pi...p«. Processes pi, pz and pz belong to group G\, and processes pz, 
pz and P 4 to group G?. Notice that mj and m3 are delayed at pz, because it is a member 
of G 1 and must receive mi first. However, m3 is not delayed at p\, because p\ is not a 
member of <?i. And m3 is not delayed at pz, because pz has already received mj and it 
was the sender of m2. 



Figure 2: Messages sent within process groups. G\ — {pi y pz,Pz} and Gz = {pi,Pz,Pa} 

The proof of Section 3 adapts without difficulty to this new situation; we omit the nearly 
identical argument. One can understand the modified VT protocol in intuitive terms. By 
ignoring the vector timestamps for certain groups in step 2.3’, we are asserting that there 
is no need to be concerned that any undelivered message from these groups could causally 
precede m. But, the ignored entries correspond to groups to which pj does not belong, 
and it was assumed that all communication is done within groups. 

4.2 Use of partial vector timestamps 

Until the present, we have associated with each message a vector time or vector times 
having a total size determined by the number of processes and groups comprising the 
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application. Although such a constraint arises in many published CBCAST protocols, 
the resulting vector sizes would rapidly grow to dominate message sizes. A substantial 
reduction in the number of vector timestamps that each process must maintain and trans- 
mit is possible in the case of certain communication patterns, which are defined precisely 
below. Even if communication does not always follow these patterns, our new solution 
can form the basis of other slightly more costly solutions which are also described below. 

Define the communication structure of a system to be an undirected graph CG = (G, E ) 
where the nodes, G , correspond to process groups and edge (<71, <72) belongs to E iff there 
exists a process p belonging to both g\ and <72 • If the graph so obtained has no biconnected 
component 4 containing more than k nodes, we will say that the communication structure 
of the system is ^-bounded. In a fc-bounded communication structure, the length of the 
largest simple cycle is k. 5 A 0 -bounded communication structure is a tree (we neglect 
the uninteresting case of a forest). Clearly, such a communication structure is acyclic. 

Notice that causal communication cycles can arise even if CG is acyclic. For example, 
in figure 2, message mi, m2, m3 and m 4 form a causal cycle spanning both <71 and <72- 
However, the acyclic structure restricts such communication cycles in a useful way - such 
cyles will either be simple cycles of length 2, or complex cycles. 

Below, we demonstrate that it is unnecessary to transport all vector timestamps on each 
message in the fc- bounded case. If a given group is in a biconnected component of size k, 
processes in this group need only to maintain and transmit timestamps for other groups 
in this biconnected component. We can also show that they need to maintain at least 
these timestamps. As a consequence, if the communication structure is acyclic, processes 
need only maintain the timestamps for the groups to which they belong. 

We proceed to the proof of our main result in stages. First we address the special case of 
an acyclic communication structure. 

Lemma 1: If a system has an acyclic communication structure, each process in the sys- 
tem only maintains and multicast the VT timestamps of groups to which it belongs. 

Notice that under this lemma, the overhead on a message is limited by the size and number 
of groups to which a process belongs. 

We wish to show that if message mj is sent (causally) before message m*, ther -j will 
be delivered before m * at all overlapping sites. Consider the chain of messages below, 
ml m 2 m 3 mk -1 mk 

pi > p 2 > p 3 — --> .... — — > pk > pk+1 

gl g 2 g 3 gk -1 gk 

This schema signifies that process p\ multicasts message mi to group <71, that process 
P2 first receives message mi as a member of group g\ and then multicasts m2 to <72 > 

4 Two vertices are in the same biconnected component of a graph if there is a path between them after 
any other vertex has been removed. 

& The nodes of a simple cycle (other than the starting node) are distinct; a complex cycle may contain 
arbitrary repeated nodes. 
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and so forth. In general, may be the same as g } for i ^ j and p, and pj may be 
the same even for t ^ j (in other words, the processes p,- and the groups g, are not 
necessarily all different). Let the term message chain denote such a sequence of messages, 
and let the notation nij-imy mean that p transmits mj using a timestamp VT(m ) that 
directly reflects the transmission of mj. For example, say that mj was the k’th message 
transmitted by process p, in group g a . We will write mj-^my iff VT a (p } )[i] > k and 
consequently VT a (ii*y)[i] > k. Our proof will show that if mj— *my and the destinations 
of mj and my overlap, then m,— im^, where pj is the sender of my. 

We now note some simple facts about this message chain that we will use in the proof. 
Recall that a multicast to a group g a can only be performed by a process pj belonging to 
g a . Also, since the communication structure is acyclic, processes can be members of at 
most two groups. Since m* and m x have overlapping destinations, and pi, the destination 
of m x , is a member of g\ and of p 2 , then p*, the destination of the final broadcast, is 
either g\ or p 2 . Since CG is acyclic, the message chain m x simply traverses part of 
a tree reversing itself at one or more distinguished groups. We will denote such a group 
g r . Although causality information is lost as a message chain traverses the tree, we will 
show that when the chain reverses itself at some group g r , the relevant information will 
be “recovered” on the way back. 

Proof of Lemma 1: The proof is by induction on l, the length of the message chain 
m i— m fc- Recall that we must show that if m x and m^ have overlapping destinations, they 
will be delivered in causal order at all such destinations, i.e m x will be delivered before 
m*. 


Base case. I - 2. Here, causal delivery is trivially achieved, since pt = p 2 must be a 
member of g\ and m* will be transmitted with p x ’s timestamp. It will therefore be 
delivered correctly at any overlapping destinations. 

Inductive step. Suppose that our algorithm delivers all pairs of causally related mes- 
sages correctly if there is a message chain between them of length l < k. We show 
that causality is not violated for message chains where l — k. 

Consider a point in the causal chain where it reverses itself. We represent this by 
m r _ 1 -»m r -+m r y-*m r+ i, where m P _ x and m r+1 are sent in g T -\ = g r + 1 by p P and 
Pr+i respectively, and m r and nv are sent in g r by p r and p r ». Note that p r and 
Pr+\ “e members of both groups. This is illustrated in Figure 3. Now, m r , will not 
be delivered at p r +i until m r has been delivered there, since they are both broadcast 
in G r . We now have m r _ x £ m r P 3 l m r+x . We have now established a message 
chain between m x and mt where / < k. So, by the induction hypothesis, m x will 
be delivered before m* at any overlapping destinations, which is what we set out to 
prove. □ 

We now proceed to prove the main theorem. 
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Figure 3: Caused Reversal 

'Theorem 1: Each process pi in a system needs only to maintain and multicast the VT 
timestamps of groups in the biconnected components of CG to which pi belongs. 

Proof: As with Lemma 1, our proof will focus on the message chain that established 
a causal link between the sending of two messages with overlapping destinations. This 
sequence may contain simple cycles of length up to k , where k is the size of the largest 
biconnected component of CG. Consider the simple cycle illustrated below, contained in 
some arbitrary message chain. 

ml me mc+1 

pi > . . . p2 > p3 > 

gl gc gl 

Now, since pi, pz and pz are all in groups in a simple cycle of CG , all the groups are in the 
same biconnected component of CG, and all processes on the message chain will maintain 
and transmit the timestamps of all the groups. In particular, when m c arrives at pz, it 
will carry a copy of VT g \ that indicates that mj was sent. This means that m c will not 
be delivered at pz until mi has been delivered there. So m c+ i will not be transmitted 
by pz until mi has been delivered there. Thus We may repeat this process 

for each simple cycle of length greater than 2 in the causal chain, reducing it to a chain 
within one group. We now apply Lemma 1, completing the proof. □ 

Theorem 1 shows us what timestamps are sufficient in order to assure correct delivery of 
messages. Are all these timestamps in fact necessary? It turns out that the answer is yes. 
It is easy to show that if a process that is a member of a group within a biconnected com- 
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ponent of CG does not maintain a VT timestamp for some other group in CG, causality- 
may be violated- We therefore state without formal proof: 

Theorem 2: If a system uses the VT protocol to maintain causality, it is both necessary 
and sufficient for a process pi to maintain and transmit those VT timestamps correspond- 
ing to groups in the biconnected component of CG to which pi belongs. 

4.3 Extensions to arbitrary communication structures 

In general, managing information concerning the biconnected components of CG may be 
difficult, especially in a dynamic environment. We believe that the most practical use of 
the above result is in the acyclic case, since a process can conservatively determine that it 
is not in any cycle by observing that the group of which it is a member overlaps with at 
most one other group - a completely local test (but see also Section 4.4.4). Consequently, 
although all our results generalize, the remainder of the paper focuses on the acyclic 
solution, and we initially implemented only the acyclic solution in Isis. In this section, 
we give two protocols that work in more general communication structures. The first 
protocol does not use any knowledge about the communication structure, but it sometimes 
imposes delays on message multicasting. The second protocol does use knowledge about 
the communication structure, but does not impose delays on message multicasting. We 
then extend both protocols to arbitrary dynamic communication structures. 

4.3.1 Conservative solution 

Our first solution is denoted the conservative protocol Each multicast m is followed by a 
second multicast terminate(m) signifying that m has reached all of its destinations. The 
sender of a multicast will normally know when to send the terminate as a side-effect of 
the protocol used to overcome packet loss. The terminate message may sent as a separate 
multicast, but it can also be piggybacked on the next CBCAST sent to the same group. 
A terminate message is not itself terminated. 

We will say that a group is active for process p, if: 

1. p is the initiator of a multicast to g that has not terminated, or 

2. p has received an unterminated multicast to p, or 

3. p has delayed the local delivery of a multicast to g (sent by some other process p'). 

Note that this is a local property; i.e. process p may compute whether or not it is active 
for some group g by examining its local state. The conservative multicast rule states that 
a process p may multicast to group g iff g is the only active group for process p or p 
has no active groups. Multicasts are sent using the VT protocol, as usual. Notice that 
this rule imposes a delay only when two causally successive messages are sent to different 
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groups. The conservative solution could be inefficient, but yields a correct VT protocol. 
However, the overhead it imposes could be substantial if processes multicast to several 
different groups in quick succession, and it is subject to potential starvation (this cam, 
however, be overcome). 

The conservative solution will work correctly even if group membership changes dynami- 
cally. 

For brevity, we omit the correctness proof of this solution. The key point is that if 
p multicasts m to gi after g\ has ceased to be active, then there are no undelivered 
multicasts m' in g\ s.t. m'—nn. This can be demonstrated by showing that if g\ is no 
longer active and m'-+m, then m' has terminated. 


4.3.2 Excluded Groups 

Assume that CG contains cycles, but that some mechanism has been used to select a 
subset of edges X such that CG' = (G, E - X ) is known to be acyclic. We extend our 
solution to use the acyclic VT protocol for most communication within groups. If there 
is some g' such that (g,g')(zX we will say that group g is an excluded group and some 
multicasts to or from g will be done using one of the protocols described below. 

Keeping track of excluded groups could be difficult; however it is easy to make pessimistic 
estimates (and we will derive an protocol that works correctly with such pessimistic es- 
timates). For example, in Isis, a process p might assume that it is in an excluded group 
if there is more than one other neighboring group. This is a safe assumption; any group 
in a cycle in CG will certainly have two neighboring groups. This subsection and the two 
that follow develop solutions for arbitrary communication structures, assuming that some 
method such as the previous is used to safely identify excluded groups. 

4.3.3 Combining the VT and IT protocols 

Recall the LT multicast protocol presented in Section 3. The protocol was inefficient, 
but required that only a single timestamp be sent on each message. Here, we run the 
LT and VT protocols simultaneously, piggybacking on each message both LT and VT 
timestamps, and apply a unified version of the LT and VT delivery schemes on receipt. 
The LT timestamp is not incremented on every broadcast; it is only incremented on certain 
broadcasts as described below. This greatly reduces the number of extra messages that 
would be induced by the basic LT algorithm. 

Say that m is to be multicast by p to group g. We say that p is not safe in g if: 

• The last message p received was from some other group g'. 

• Either g or g‘ is an excluded group. 
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Our protocol rule is simple; on sending, if process p is not safe in group g, p will incre- 
ment both its’ LT timesamp and its’ VT timestamp before multicasting a message to g. 
Otherwise, it will just increment its’ VT timestamp. A message is delivered when it is 
deliverable according to both the LT delivery rule and the VT delivery rule. 

Notice that the pinging overhead of the LT protocol is incurred only when logical clock 
values actually change, which is to say only on communication within two different groups 
in immediate succession, where one of the groups is excluded. That is, if process p executes 
for a period of time using the VT protocol and receives only messages that leave LT(p) 
unchanged, p will ping each neighbor processes at most once. Clocks will rapidly stabilize 
at the maximum existing LT value and pinging will then cease. 

Theorem 3: The combined VT-LT protocol will always deliver messages correctly in ar- 
bitrary communication structures. 

Proof: Consider an arbitrary message chain where the first and last messages have over- 
lapping destinations. Without loss of generality, we will assume that gi...gt are distinct. 
We wish to show that the last message will be delivered after the first at all such destina- 
tions. 

ml m2 m3 mk-1 mk 

pi > p2 > p3 > .... > pk > pk+1 

gl g2 g3 gk-1 gk 

If none of g\...gi is an excluded group, then, by Lemma 1, m\ will be delivered before 
mjfc at all overlapping destinations. Now, if some group gi is excluded, two cases arise - 
either the last group, gk is excluded, or some other group is excluded. If gk is excluded, 
then Pk will increment its LT timestamp at some point between delivering mk-i and 
sending m*. If some other group g, is excluded, i < k, then pt+i will increment its LT 
timestamp between delivering m* and sending mfc+j. So the LT timestamp of m* will 
always be greater than the LT timestamp of , and mk will be delivered after m\ at all 
overlapping destinations. □ 

4.4 Dynamic membership changes 

We now consider the issue of dynamic group membership changes when using the com- 
bined prot ocol. This raises se veral i ssues that are a ddressed in tur n: virtuall y synchronous 
addressing when joins occur, initializing VT timestamps, atomicity when failures occur, 
and the problem of detecting properties of CG at runtime, such as when a process deter- 
mines that its’ group adjoins at most on one other and hence always uses the acyclic VT 
protocol. 
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4.4.1 Joins 

To achieve virtually synchronous addressing when group membership changes while multi- 
casts are active, we introduce the notion of flushing the communication in a process group. 
Consider a process group g in group view t riewj(g). Say that a new view viewi +1 (g) now 
becomes defined. There are two cases: viewi+i(g ) could reflect the addition of a new 
process, or it could reflect the departure (or failure) of a member. Assume initially that 
view changes are always due to adding new processes (we Randle failures in Section 4.4.3). 
We will flush communication by having all the processes in view,+x(g) send a message 
"flush i+1 of g", to all other members. During the period after sending such messages 
and before receiving such a flush message from all members of view t+ i(g) a process will 
accept and deliver messages but will not initiate new multicasts. 

Because communication is FIFO, if process p has received a flush message from all mem- 
bers of g under view i + 1, it will first have received any messages sent in view i. It 
follows that all communication sent prior to and during the flush event was done using 
VT timestamps corresponding to view,(g), and that all communication subsequent to 
installing the new view is sent using VT timestamps for tneu/,+i (<7). This establishes that 
multicasts will be virtually synchronous in the sense of Section 2. 

4.4.2 Initializing VT fields 

Say that process p } is joining gToup g a . Then pj will need to obtain the current VT 
values for other group members. Because pj participates in the flush protocol, this can 
be achieved by having each process include its VT value in the flush message, pj will 
initialize VT a [i] with the value it rece ive s in the flush message from p,; pj initializes 
VT a \j] to 0. 

4.4.3 Failure atomicity , 

What about the case where some member of g fails during an execution? viewi+i(g) will 
now reflect the departure of some process. Assume that process pj has received a message 
m that was multicast by process p,. Ifp, now fails before completing its multicast, there 
may be some third process p* that has not yet received a copy of m. To solve this problem, 
Pj must retain a copy of all delivered messages, transmitting a copy of messages initiated 
by pi to other members of view(g) if p,- fails. Processes identify and reject duplicates. 

Multicasting now becomes the same two-phase protocol needed to implement the conser- 
vative rule. The terminate message indicates which messages may be discarded; it can 
be sent as a separate message or piggybacked on some other multicast. 

On receiving viewk(g) indicating that p, failed, p 3 runs this protocol: 

1. Close the channel to p,-. 
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2. For any untenninated multicast m initiated by Pi, send a copy of m to all processes 
in viewk(g) (duplicates are discarded on reception). 

3. Send a flush message to all processes in viewk(g). 

4. Simulate receipt of flush and ack messages from p, as needed by the channel and 
view flush protocols, and treat any message being sent to p,- as having been delivered 
in the conservative protocol (Section 4.3.1). 

5. After receiving flush messages from all processes in viewk(g), discard any messages 
delayed pending on a message from p*. 

6. pj ceases to maintain VT g [i]. 

Step 2 ensures atomicity and step 4 prevents deadlock in the VT, LT and the conservative 
protocol. Step 5 relates to chains of messages mi~*m 2 where a copy of m 2 has been 
received but mi was lost in a failure; this can only happen if every process that received 
mi has failed (otherwise a copy of m a would have been received prior to receipt of the 
flush message). In such a situation, m 2 will never have been deliverable and hence can 
be discarded. 

This touches on an important issue. Consider a chain of communication that arises 
external to a process group but dependent on a multicast within that group. Earlier, we 
showed that causal delivery is assured by the acyclic VT protocol, but this assumed that 
multicasts would not be lost. Instead, say that processes pi and p 2 belong to group <71 and 
that process p 2 also belongs to g 2 . p\ multicasts mi to g 2 \ p 2 receives mi and multicasts 
m2 to g 2 . Now, if pi and p 2 both fail, it may be that mi is lost but that m 2 is received 
by the members of <71 n g 2 that are still operational. 

Several cases now arise, all troubling. Consider a process q that receives m 2 . If q receives 
m2 prior to running the failure protocol, it will discard it under step 5. If q receives 
m2 after running the failure protocol, however, it will have discarded the VT field corre- 
sponding to pi . m 2 will not be delayed pending receipt of mi and hence will ultimately 
be delivered, violating causality. ( q cannot discard m2 because it may have been deliv- 
ered elsewhere.) We thus see that both causality and atomicity could be violated by an 
unfortunate sequence of failures coincident with a particular pattern of co mm u n ication, 
and that the system will be unable to detect that this has occurred. 

One way to avoid this problem is to require that processes always use the conservative 
rule of Section 4.3.1, even if the communication structure is known to be acyclic. In our 
example, this would prevent p 2 from communicating in g 2 until mi reached its destinations. 
Recall that step 4 of the protocol given above prevents the conservative rule from blocking 
when failures occur. 

An alternative is to accept some risk and operate the system unsafely. For example, 
a process might be permitted to initiate a multicast to group g only if all of its own 
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multicasts to other groups have been delivered to at least one other destination process; 
this yields a protocol tolerant of any single failure. 6 

Given a 1-resilient protocol, the sequence of events that could cause causal delivery to be 
violated seems quite unlikely. A ^-resilient protocol can be built by also delaying receivers; 
for large k, this reverts to the conservative approach. 

We believe that even for a 1-resilient protocol, the scenario in question (two failures that 
occur in sequence simultaneously with a particular pattern of communication) is extremely 
improbable. The odds of such a sequence occurring is probably outweighed by the risk of 
a software bug or hardware problem that would cause causality to be violated for some 
mundane reason, like corruption of a timestamp or data structure. 

Our initial implementation of bypass CBCAST uses the conservative solution between all 
groups; i.e. all groups are excluded. The VT protocol is used for communication within a 
group. This version of Isis is thus immune to the causality and atomicity problems cited 
above, but incurs a high overhead if processes multicast to a series of groups in quick 
succession, which is not uncommon. Our plan is to modify the implementation to use 
the more opt imis tic protocols in a 1-resilient manner, but to provide application designers 
with a way to force the system into a completely safe mode of operation if desired. It 
should be noted that limitations such as this are common in distributed systems; a review 
of such problems is included in [BJ89]. We are not alone in advocating a “safe enough” 
solution in order to increase performance. 

4.4.4 Dynamic communication graphs 

A minor problem arises in applications having the following special structure: 

1. The combined VT-LT protocol is in use. 

2. Processes may leave groups other than because of failures (in Isis, this is uncommon 
but possible). 

3. Such a process may later join other groups. 

Earlier, it was suggested that a process might observe that the (single) group to which 
it belongs is adjacent to just one other group, and conclude that it cannot be part of a 
cycle. In this class of applications, this rule may fail. 

To see this, suppose that a process p belongs to group gi, then leaves g\ and joins g?. If 
there was no period during which p belonged to both g\ and g?, p would use the acyclic 
VT protocol for all communication in both g\ and g-i. Yet, it is clear that p represents a 
path by which messages sent in gi could be causally dependent upon messages p received 

8 When using a transport facility that exploits physical multicast such a message will most often have 
reached all of its destinations. 
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m gu leading to a cyclic message chain that traverses 9l and g 2 This creates a 
condition undor which violations of the causal delivery ordering Sold result 

c^,r f T t" Tr “ ,he foaowfa e manner. Associate with each group a 

counter of the number of other groups to which it has ever been adjacent- this requires 

only a trivial extension of the Hush protocol. Moreover, my that evm after a proc^Tp 

tk ve, ,?* T0np *>• 14 Reports itself as a one-time member of j,. If p joins some group q, 

the adjacency count for „ will now reflect its prior membership, and if a cauSl <±2l’ 

could poteibly ante, multicast, will be under the exclusion rule aeariyN^Hol^ 

is conservative and could be costly. On the other hand, say that it is known SnfS 

multicasts termmate within some time delay <r. Then one could decrement the adjacency 

! ° r * groap * ft “ 1 "v °< » *tae units without risk. In Isis, a reasonable value 
of a would be on the order of 2-3 seconds. ^ e 

We have developed more sophisticated solutions to this problem, but omit these because 
theissue only arises m a small class of applications, and the methods and their proS ™ 


4.4.5 Recap of the extended protocol 

In presenting our algorithm as a basic scheme to which a series of extensions and modi- 

^ithTb T C ’ W€ , m , ay 6 ° bSCUred the ° VeraU picture * We conclude the section 
with a bnef summary of the protocol as we intend to use it in Isis. 

The protocol we ultimately plan to use in Isis is the acyclic VT solution combined with the 
LT protocol. This protocol piggybacks an LT timestamp and a list of VT timestamps on 

\ T'°J *. sender of .he 

In addition to the code for delaymg messages upon reception, the protocol implements 
the channel- and view-flush and terminate algorithms. implements 

Under most conditions the Isis system will be operated conservatively, excluding groups 
adjacent to more than one neighboring group. As noted above, neighboring groufs cL 
be counted by piggybacking information on the view-flush protocol. Lairing to the 
futime, we expect to develop Isis subsystems that will have special a-priori knowledge 
of the communication structure. These subsystems will make use of an Isis system ^ 

*?“■*• ™ JE/F i tSE > 4 »i Mcate.be «dusiouste.us of groups. Wecurendy 
p o develop sophisticated communication topology algorithms for Isis 

t^etW^t?^ consists of VT scheme and the conservative’ rule, 

shortly thi n h fl ? h 111(1 t * nmnate P rotocola - W « «pect to add the LT extension 
shortly, the necessary code is small compared to what is already running. 
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5 Other communication requirements 

In this section we consider some minor extensions of the protocol for other common 
communication requirements. 

5.1 A Bypass ABCAST protocol 

Readers may wonder if the bypass CBCAST protocol can be extended into a fast AB- 
CAST mechanism. ABCAST is a totally ordered communication protocol: all destina- 
tions receive an ABCAST message in a single, globally fixed order. 

The answer to this question depends on the semantics one associates with ABCAST 
addressing. One way to define ABCAST is to say that two ABCAST’s to the same 
logical address will be totally ordered, but to make no guarantees about ordering for 
ABCAST messages sent to different addresses. A more powerful alternative is to say 
that regardless of the destination processes, if two ABCAST’s overlap at some set of 
destinations, they are delivered in the same order. Although Isis currently supports the 
latter approach, it is far easier to implement a bypass ABCAST with the weaker delivery 
semantics; the resulting protocol resembles the one in [CM84]. This is in contrast with 
bypass CBCAST, which always achieves causal ordering. 

Associated with each view viewi(g) of a process group g will be a token holder process, 
token(g)£viewi(g). If the holder fails, the token is automatically reassigned to a live 
group member using any well-known, deterministic rule. Assume that each message m is 
uniquely identified by uid(m). 

To ABCAST m, a process holding the token uses CBCAST to transmit m. If the 
sender is not holding the token, the ABCAST is done in stages: 

1. The sender CBCAST’s a needs-order message containing m. 4 * * 7 Processes receiv- 
ing this message delay delivery of m. 

2. If a process holding the token receives a needs-order message, it CBCAST’s a 
sets -order message giving a list of one or more messages, identified by aid, and 
the order in which to deliver them, which it may chose arbitrarily. If desired, a new 
token holder may also be specified in this message. 

3. On receipt of a sets-order, a process notes the new token holder and delivers 
delayed messages in the specified order. 

4. On detection of the failure of the token holder, after completing the flush protocol, 
all processes sort pending ABCAST’s and deliver them in any consistent order. 

7 It might appear cheaper to forward such a message directly to the token holder. However, for a 

moderately large messages such a solution will double the IO done by the token holder, creating a likely 

bottleneck, while reducing the IO load on other destinations only to a minor degree. 
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This protocol is essentially identical to the replicated data protocol proved correct in 
[BJ89,Sch88]. Step 4 is correct because the flush ensures that any set-order messages 
will have been delivered atomically, hence all processes will have the same enqueued 
messages which they deliver immediately before installing the new view. 

The cost of doing a bypass ABCAST depends on the locations where multicasts originate 
and frequency with which the token is moved. If multicasts tend to originate at the same 
process repeatedly, then once the token is moved to that site, the cost is one CBCAST 
per ABCAST. If they originate randomly and the token is not moved, the cost is 1 + 1/A: 
CBCAST’s per ABCAST, where we assume that one set-order message is sent for 
ordering purposes once for every k ABCAST’s. This represents a major improvement 
over the existing Isis ABCAST protocol. However, because bypass ABCAST achieves 
a weaker form of ordering, it might require changes to existing Isis applications. We have 
not yet decided whether to make it the default. 

5.2 Point-to-point messages 

Early in the the paper, we asserted that asynchronous CBCAST is the dominant protocol 
used in Isis. Point-to-point messages, arising from replies to multicast requests and and 
RPC interactions, are also common. In both cases, causal delivery is desired. Here, we 
consider the case of point-to-point messages sent by a process p within a group G to which 
p belongs. 

A straightforward way to incorporate point-to-point messages into our VT protocol is to 
require that they be acknowledged and to inhibit the sending of new multicasts during 
the period between when such a message, is transmitted an<T when the acknowledgement is 
received (in the case of an RPC, the reply is the acknowledgement). The recipient is not 
inhibited, and need not keep a copy of the message. A point-to-point message is times- 
tamped using the sender’s logical and vector times, a nd delivered using the corresponding 
delivery algorithms, but neither timestamp is incremented prior to transmission. In effect, 
point-to-point thessages are treated as events internal to the processes involved. 

The argument in favor of this method is that a single point-to-point RPC is fast and 
the cost is unaffected by the size of the system. Although one can devise more complex 
methods that eliminate the period of inhibited multicasting, problems of fault- tolerance 
render them less desirable. 

5.3 Subset multicasts 

Some Isis applications form large process groups but require the ability to multicast to 
subsets of the total membership. Our protocol is easily extended into one supporting 
subset multicast, and our initial Isis implementation supports this as an option. When 
enabled, a VT vector timestamp of length an is needed for a group with a senders and n 
members. 
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For example, a stock brokerage might support a quote dissemination service with two or 
three transmitters and hundreds of potential recipients. Rather than form a subgroup 
for each stock (costly approach if there are many stocks), each multicast could be sent 
to exactly those group members interested in a given quote. We omit the details of the 
subset multicast extension. 


6 Performance and transport protocol selection 

In this section, we discuss the performance of our protocol. We show that the performance 
of the bypass protocol will be largely dominated by the performance of the underlying 
layer that is simply concerned with moving data from one site to others. We discuss the 
design of some alternatives for this layer, which we are currently implementing. 

6.1 Complexity and overhead of the protocol 

Implementation of the bypass protocol was straightforward in Isis, requiring less than 1300 
lines of code out of the total of 52,000 in the protocol layer of the system. Extensions 
to support the LT protocol will add little additional code. Initial measurements of 
performance demonstrate a five to tenfold speedup over the prior Isis protocols. 

Our protocol has an overhead of both space and messages transmitted. The size of a 
message will be increased by the vector time fields it carries; as noted above, the number 
of such vectors is determined by the total cardinality of the groups to which the sender 
belongs directly, and hence will be small. The number of overhead messages sent will 
depend on the number of non- piggybacked terminate messages sent by the conservative 
protocol and, when implemented, the frequency of LT pinging. In Isis, LT pinging is 
expected to be rare and terminate messages Me always piggybacked on a subsequent 
CBCAST unless communication in a group quiesces. (As noted before, LT overhead can 
be bounded using a periodic protocol, if necessary). 

We believe that latency, especially when the sender of a multicast must delay before 
continuing computation, is the most critical and yet unappreciated form of overhead. 
Delays of this form are extremely noticable. In many systems, there is only one active 
computation at a given instant in time, or a single computation that holds a lock or other 
critical resource. Delaying the sender of a multicast may thus have the effect of shutting 
down the the entire system. In contrast, the delay between when a message is sent and 
when it reaches a remote destination is less relevant to performance. The sender may 
be delayed in two ways: if the transmission protocol itself is computationally costly, or 
if a self-addressed multicast cannot be delivered promptly because it is unsafe to do so. 
Defined in this sense, our method imposes latency on the sender of a multicast only in the 
conservative protocol, and only when a process switches from multicasting in one group 
to another, or needs to communicate in one group after receiving in another. Otherwise, 
the protocol is totally asynchronous. Latency on the transport side is less critical. The 
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dominant source of transport latency is LT pinging, and we plan to quantify this effect 
by instrumenting Isis and using simulations. 

6.2 Implementation 

An interesting feature of the bypass facility is that it assumes very little about communi- 
cation between processes, and communicates in an extremely regular manner. Specifically, 
the protocol we ended with sends or multicasts only within groups to which a sending 
process belongs, and requires only that inter-process communication be sequenced and 
lossless. The idea of providing an interface by which the bypass multicast protocols could 
run over a lower-layer protocol provided by the application appealed to us, and as part of 
the Isis implementation of bypass CBCAST and ABCAST, we included an interface 
permitting this type of extension. We call this lower layer the multicast transport protocol. 
A multicast transport protocol simply delivers messages reliably, in FIFO order, to the 
groups or processes addressed. 

When no special hardware for multicasting is available, the basic Isis multicast transport 
protocol is based on TJDP (unreliable datagrams). When multicasting hardware is avail- 
able, Isis can switch to an experimental multicast transport protocol that takes advantage 
of such hardware. The remainder of this section details the design, performance and over- 
head of these multicast transport protocols (in time, size, and messages exchanged per 
multicast). 

6.3 Overhead imposed by the basic VT Protocol 

This section breaks down the costs we see in terms of various components of the overhead 
(create a light weight task, do the I/O, select system call, create the packets, reconstruct 
them on reception). Figure 4 breaks down the basic CPU costs of sending and receiving 
messages in our implementation. These figures are preliminary and will be revised. These 
figures are for the combined protocol, but they do not reflect higher level delays that 
might be imposed by infrequent events such as LT pinging or the view flush. Our figures 
were derived on a pair of SUN 3/60’s doing continuous null RPC’s from one to the other. 
The RPC request was sent in a CBCAST; the result returned in a CBCAST reply 
packet. A new lightweight task was created at the receiver to field each RPC request. 
An Isis message is fairly complex and allows scatter/gather and arbritrary user-defined 
and system-checked types. Since no attempt has been made to optimize message data 
structures for the simple case of a null RPC, this accounts for a a large part of the time 
spent in the messaging/task layer of the system. 

The main conclusion from these measurements is that the CBCAST algorithms we derive 
in this paper are quite inexpensive. Most of the time that a message spends in transit is 
spent in the lower layers of the system. Clearly, the cost of UNIX messaging is beyond 
our control, but a great deal can be said about multicast transport. 
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Figure 4: Basic protocol overhead 
6.4 Multicast transport protocol selection 

The basic Isis multicast transport protocol is designed around a point-to-point model. 
Each process in a group maintains a two-way reliable data stream with each other process 
in the group. Whenever possible, acknowledgement information is piggybacked on other 
packets, such as replies to an RPC or multicast. These streams are maintained indepen- 
dently of each other; for brevity, we omit discussion of such details as flow control and 
failure detection. This scheme has several advantages; it is relatively easy to understand, 
as it is based on a well-known communication model. Since it is built on top of unreliable 
datagrams, it can be easily implemented on any network that provides this service. It has, 
however, several disadvantages - in particular, it does not scale well. The processing and 
network transmission costs of communicating with a group rise linearly with the number 
of processors in the group. In addition, as the number of processes in a group increases, a 
process sending to the group may experience congestion at the network interface as many 
acknowledgement or reply packets arrive more or less simultaneously from the other other 
processes in the group. 

We have therefore investigated the design of other multicast transport protocols. An ideal 
multicast transport protocol would have the following features: 

• It would be independent of network topology, but able to take advantage of features 
of particular networks - e.g. a broadcast subnet. 

• The cost of sending a message would be independent of the number of recipients of 
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that message. 

• It would work efficiently for both small and large messages. 

• It would have low overhead, latency and high throughput. 

It is also important to note that frequently a multicast may give rise to many replies 
directed to the original sender. We call such an occurrence a convergecast. This can 
lead to congestion at the original multicast sender, with many of the replies being lost. 
To avoid this, a multicast transport protocol should have some sort of fnechanism for 
co-ordinating and reliably delivering multicast replies. Similar considerations may apply 
to acknowledgements; however acknowledgements need not be as timely as replies - the 
multicast transport protocol has more freedom to delay them. 

Generally speaking, a reliable multicast transport me c hani sm will be used in two distinct 
modes. In the first, stream mode, one process will multicast a large amount of data to the 
group before another process wishes to reply. Multicasting is continuous. This usage could 
arise in, for example, a trading system, where the transport mechanism is being used to 
disseminate quotes to trading stations. Another example is a replicated hie system where 
a client workstation is writing a file to. a group of file servers. In rpc mode, many processes 
multicast replicated rpc’s to the group, where each rpc contains relatively little data, and 
is much more likely to actually require a reply. Multicasts are not continuous, but bursty. 
This could arise in maintaining and querying a distributed database or maintaining the 
state of a distributed game. Note that the application using the multicast transport 
protocol can provide hints as to which mode it thinks it is operating in. Intermediate 
modes of usage can of course arise; we do not expect them to be common. 

Reliable multicast transport protocols may be divided into two classes; those based on 
positive acknowledgements, and those based on negative acknowledgements. Many pre- 
vious proposals for reliable multicast transport protocols have been based on negative 
acknowledgements, including [KTH5S9,AHLS9,CMS4]. (Some of these protocols, in ad- 
dition to providing reliable transport, also provide transport ordering properties.) This is 
because the designers of these protocols believed that a positive acknowledgement from 
each receiving site would be expensive. We do not believe that this is so. 

If a process group is largely communicating in rpc mode, reply messages will be converging 
at the sender in any case. These reply messages can carry positive acknowledgements. In 
addition, if there are many of these reply messages, they should be scheduled by some 
mechanism to avoid congestion and message loss at the multicast sender. On the other 
hand, if a group is largely co mmuni cating in stream mode, the issue of flow control becomes 
very important. The sender can’t send data faster than the slowest process in the group 
can receive it; in order to avoid packet loss, there will be flow control packets coming 
back to the sender from each other process in the group. Again, these packets may carry 
positive acknowledgments, and again, they must be scheduled in order to avoid congestion 
problems. The protocol has more flexibility in scheduling these packets than in scheduling 
reply packets, since they do not contain data that needs to be delivered to the higher level. 
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There are several possible mechanisms for scheduling packets that are converging on the 
same destination. One scheme is for the original sender to schedule the packets; it will 
decide how many concurrent acknowledgments or replies it (and the network) can handle. 
It then schedules each group of acknowledgements. This scheme involves some extra work 
by the sender; it has the advantage that the sender can control the rate at which the 
packets come back depending on whether or not his client is waiting for replies. 

Other methods involve the receivers co-operating to ensure that they don’t send too many 
packets to the sender at once. One such method basically involves passing one or several 
tokens around the group, with the holder of a token having the right to send reply or 
acknowledgement packets to the original sender. If the replies or acknowledgements are 
small, they can be put on the token itself, which is returned to the sender when it is full. 
The main problem with this scheme is that the acknowledgement or reply may take a long 
time to return to the original sender of a message. This can be overcome by using large 
window sizes, or by using a large enough number of tokens. Another problem is that the 
overhead of receiving a message is higher, because an acknowledgement token must be 
received and transmitted also. This can be overcome by having one token acknowledge 
several messages, and by piggybacking the acknowledgement token wherever possible. A 
third problem is that the loss of one acknowledgement packet may cause a message to be 
retransmitted to multiple destinations. We believe that the extra overhead is acceptable, 
since packet loss should be rare. 

Another receiver-scheduled method for handling acknowledgements or replies is simply 
to have each acknowledgement be returned at some random time by the recipients. This 
scheme has been extensively analyzed by [Dan89]; the main problem is that in order to 
avoid congestion at the original sender, the interval from which the random delays must be 
picked is very long. It is also of course possible to combine several of the above schemes; 
for example, acknowledgements could be sender- scheduled in small groups; individual 
acknowledgements within each group could be further randomly delayed. 

We are implementing multicast transport protocols with several of the convergecast- 
avoidance scheduling strategies described above, and will experiment with them as al- 
ternatives to the basic ISIS multicast transport protocol. Our implementations are based 
on the multicast UDP software of [Dee88], which provides a logical unreliable multicast 
across internets independently of whether the underlying networks support physical mul- 
ticast. Full details of the design and implementation of these protocols will be found 
in [Ste90]. We will include performance measurements for the bypass CBCAST and 
ABCAST protocols running over these transport protocols in the final version of the 
paper. 
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7 Related Work 


There has been a great deal of work on multicast primitives. CBCAST-like primitives 
are described in [BJ87,PBS89,VRB89,SES89,LL86] As noted earlier, our work is most 
closely related to that of Ladkin and Peterson. Both of these efforts stopped at essentially 
the point we reached in Section 3 arriving a protocols that would perform well within a 
single small group, but subject to severe drawbacks in systems with large numbers of pro- 
cesses and of overlapping, dynamically changing process groups. Pragmatic considerations 
stemming from our desire to use the protocol in ISIS motivated us to take our protocol 
considerably further. We believe the resulting work to be interesting from a theoretical 
perspective. Viewed from a practical perspective, a causal multicast protocol that scales 
well and imposes little overhead under typical conditions certainly represents a valuable 
advance. 

ABCAST-like primitives are reported in [CM84,BJ87,GMS89,PGM85]. Our ABCAST 
protocol is motivated by the Chang- Maxemchuck solution [CM84], but is simpler and 
faster because it can be expressed in terms of a virtually synchronous bypass CBCAST. 
In particular, our protocol avoids the potentially lengthy delays required by the Chang- 
Maxemchuck approach prior to committing a message delivery ordering. We believe this 
argues strongly for a separation of concerns in particular, a decoupling process group 
management from the communication primitive itself. 

We note that of the many protocols described in the literature, very few have been imple- 
mented, and many have potentially unbounded overhead or postulate knowledge about 
the system communication structure that might be complex to deduce. This makes direct 
performance comparisons difficult, since many published protocols give performance esti- 
mates based on simulations or measure dedicated implementations on bare hardware. We 
are confident that the Isis bypass communication suite gives performance fully competi- 
tive with any alternative. The ability to extend the transport layer will enable the system 
to remain competitive even in settings with novel architectures or special communication 
hardware. 

The ability to run the bypass protocols over new transport protocols raises questions for 
future investigation. For example, one might run bypass CBCAST over a transport 
layer with known realtime properties. Depending on the nature of these properties, such 
a composed protocol could satisfy both sets of properties simultaneously, or could favor 
one over the other. For example, the delay of flushing channels suggests that realtime 
and virtual synchrony properties are fundamentally incompatible, but this still leaves 
open the possibility of supporting a choice between weakening the realtime guarantees 
to ensure that the system will be virtually synchronous and weakening virtual synchrony 
to ensure that realtime deadlines are always respected. For many applications, such a 
choice could lead to an extremely effective, tuned solution. Pursuing this idea, we see 
the Isis system gradually evolving into a more modular structure composed of separable 
facilities for group view management, enforcing causality, transporting data, and so forth. 
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For a particular setting, one would select just those facilities actually needed. Such a 
compositional programming style has been advocated by others, notably Larry Peterson 
in his research on the Psync system. 

8 Conclusions 

We have presented a new scheme, the bypass protocol, for efficiently implementing a re- 
liable, causally ordered multicast primitive. Intended for use in the Isis toolkit, it offers 
a way to bypass the most costly aspects of Isis while benefiting from virtual synchrony. 
The bypass protocol is inexpensive, yields high performance, and scales well. Measured 
speedups of more than an order of magnitude were obtained when the protocol was im- 
plemented within Isis. Out conclusion is that systems such as Isis can achieve perfor- 
mance competitive with the best existing multicast facilities - a finding contradicting the 
widespread concern that fault -tolerance may be unacceptably costly. 
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